Gross Domestic Product (GDP) is a monetary value that represents the value of all goods and services sold within a particular geographical region. GDP per capita is the GDP of a region divided by its population. We will use a dataset available on Kaggle, that provides the GDP of counties in the United States over the time period from 2001 to 2018. This dataset was sourced from BEA.gov which is the website for the Bureau of Economic Analysis.
We can combine the data from this data set with census data to get the population from each county in order to calculate the GDP per capita of each county.
The goal of this tutorial will be to learn about the trends of GDP per capita in different regions in the United States and ultimately create a model that will be able to predict the change in the GDP per capita of a county given its geographical region in the country.
First we will import some python libraries that we will be using during the data analysis process, and then next we will import our data from the CSV file we downloaded into a Pandas DataFrame.
Once we have our data in DataFrame we can begin to take a look a what columns have been provided by the dataset.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from sklearn import tree
from sklearn import model_selection
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import plotly.express as px
from urllib.request import urlopen
import json
/opt/conda/lib/python3.9/site-packages/statsmodels/compat/pandas.py:65: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import Int64Index as NumericIndex
gdp_data = pd.read_csv("GDP by County.csv")
gdp_data
| index | Year | County FIPS | Region | SUB_REGION | State | STATE_ABBR | County | County Full Name | GDP (Chained $) | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2001 | 1001 | Southeast | East South Central | Alabama | AL | Autauga | Autauga County | 954588000 |
| 1 | 1 | 2001 | 1003 | Southeast | East South Central | Alabama | AL | Baldwin | Baldwin County | 3929753000 |
| 2 | 2 | 2001 | 1005 | Southeast | East South Central | Alabama | AL | Barbour | Barbour County | 813278000 |
| 3 | 3 | 2001 | 1007 | Southeast | East South Central | Alabama | AL | Bibb | Bibb County | 292693000 |
| 4 | 4 | 2001 | 1009 | Southeast | East South Central | Alabama | AL | Blount | Blount County | 810381000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 55558 | 55558 | 2018 | 56037 | Rocky Mountain | Mountain | Wyoming | WY | Sweetwater | Sweetwater County | 3836603000 |
| 55559 | 55559 | 2018 | 56039 | Rocky Mountain | Mountain | Wyoming | WY | Teton | Teton County | 2166420000 |
| 55560 | 55560 | 2018 | 56041 | Rocky Mountain | Mountain | Wyoming | WY | Uinta | Uinta County | 906587000 |
| 55561 | 55561 | 2018 | 56043 | Rocky Mountain | Mountain | Wyoming | WY | Washakie | Washakie County | 358104000 |
| 55562 | 55562 | 2018 | 56045 | Rocky Mountain | Mountain | Wyoming | WY | Weston | Weston County | 315885000 |
55563 rows × 10 columns
Next, we will read in the CSV containing population estimates from the US Census. Notice this data set only contains data for the years between 2010 to 2019 so we will have to combine this dataset with another dataset we will get from the US Census that will give us the data we need for the rest of the years.
population_data = pd.read_csv("co-est2019-alldata.csv", encoding="ISO-8859-1")
population_data
| SUMLEV | REGION | DIVISION | STATE | COUNTY | STNAME | CTYNAME | CENSUS2010POP | ESTIMATESBASE2010 | POPESTIMATE2010 | ... | RDOMESTICMIG2019 | RNETMIG2011 | RNETMIG2012 | RNETMIG2013 | RNETMIG2014 | RNETMIG2015 | RNETMIG2016 | RNETMIG2017 | RNETMIG2018 | RNETMIG2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40 | 3 | 6 | 1 | 0 | Alabama | Alabama | 4779736 | 4780125 | 4785437 | ... | 1.917501 | 0.578434 | 1.186314 | 1.522549 | 0.563489 | 0.626357 | 0.745172 | 1.090366 | 1.773786 | 2.483744 |
| 1 | 50 | 3 | 6 | 1 | 1 | Alabama | Autauga County | 54571 | 54597 | 54773 | ... | 4.847310 | 6.018182 | -6.226119 | -3.902226 | 1.970443 | -1.712875 | 4.777171 | 0.849656 | 0.540916 | 4.560062 |
| 2 | 50 | 3 | 6 | 1 | 3 | Alabama | Baldwin County | 182265 | 182265 | 183112 | ... | 24.017829 | 16.641870 | 17.488579 | 22.751474 | 20.184334 | 17.725964 | 21.279291 | 22.398256 | 24.727215 | 24.380567 |
| 3 | 50 | 3 | 6 | 1 | 5 | Alabama | Barbour County | 27457 | 27455 | 27327 | ... | -5.690302 | 0.292676 | -6.897817 | -8.132185 | -5.140431 | -15.724575 | -18.238016 | -24.998528 | -8.754922 | -5.165664 |
| 4 | 50 | 3 | 6 | 1 | 7 | Alabama | Bibb County | 22915 | 22915 | 22870 | ... | 1.385134 | -4.998356 | -3.787545 | -5.797999 | 1.331144 | 1.329817 | -0.708717 | -3.234669 | -6.857092 | 1.831952 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3188 | 50 | 4 | 8 | 56 | 37 | Wyoming | Sweetwater County | 43806 | 43806 | 43574 | ... | -17.605427 | 0.502513 | 15.035735 | -4.614071 | -13.140225 | -13.762031 | -18.484164 | -23.515727 | -18.743773 | -17.065527 |
| 3189 | 50 | 4 | 8 | 56 | 39 | Wyoming | Teton County | 21294 | 21298 | 21296 | ... | -1.583464 | -2.575509 | 1.905293 | 23.259519 | 12.508872 | 4.670450 | 1.253214 | -0.171608 | -12.432212 | 1.797445 |
| 3190 | 50 | 4 | 8 | 56 | 41 | Wyoming | Uinta County | 21118 | 21121 | 21089 | ... | -9.181105 | -18.196975 | -4.487730 | -10.870861 | -15.033634 | -10.484550 | -11.485101 | -18.923455 | -13.554993 | -9.181105 |
| 3191 | 50 | 4 | 8 | 56 | 43 | Wyoming | Washakie County | 8533 | 8528 | 8530 | ... | -7.652085 | -12.839390 | -3.084589 | -1.307811 | -18.938032 | 0.000000 | -15.204038 | -15.950541 | -17.246806 | -8.289759 |
| 3192 | 50 | 4 | 8 | 56 | 45 | Wyoming | Weston County | 7208 | 7208 | 7198 | ... | 0.000000 | -9.065551 | -10.549265 | 6.191515 | 0.420345 | 9.201171 | 0.693097 | -36.227798 | -6.910452 | 2.599090 |
3193 rows × 164 columns
Now we will read in the data for population estimates between the years 2001 to 2009. We can now merge our two population DataFrames using the pd.merge function. We can use this function to tell pandas to merge the columns in these two DataFrames based on when the CTYNAME and STNAME columns are the same.
Finally, we will clean up this DataFrame by removing columns we will not be using and renaming some columns to better names.
pop_2000_2009 = pd.read_csv("co-est00int-tot.csv", encoding="ISO-8859-1")
population_data = pd.merge(population_data, pop_2000_2009, on=["CTYNAME", "STNAME"])
population_data = population_data[
[
"STNAME",
"CTYNAME",
"CENSUS2010POP_x",
"ESTIMATESBASE2010",
"POPESTIMATE2010_x",
"POPESTIMATE2011",
"POPESTIMATE2012",
"POPESTIMATE2013",
"POPESTIMATE2014",
"POPESTIMATE2015",
"POPESTIMATE2016",
"POPESTIMATE2017",
"POPESTIMATE2018",
"POPESTIMATE2019",
"ESTIMATESBASE2000",
"POPESTIMATE2000",
"POPESTIMATE2001",
"POPESTIMATE2002",
"POPESTIMATE2003",
"POPESTIMATE2004",
"POPESTIMATE2005",
"POPESTIMATE2006",
"POPESTIMATE2007",
"POPESTIMATE2008",
"POPESTIMATE2009",
"CENSUS2010POP_y",
"POPESTIMATE2010_y",
]
]
population_data = population_data.rename(columns={"POPESTIMATE2010_x": "POPESTIMATE2010"})
population_data
| STNAME | CTYNAME | CENSUS2010POP_x | ESTIMATESBASE2010 | POPESTIMATE2010 | POPESTIMATE2011 | POPESTIMATE2012 | POPESTIMATE2013 | POPESTIMATE2014 | POPESTIMATE2015 | ... | POPESTIMATE2002 | POPESTIMATE2003 | POPESTIMATE2004 | POPESTIMATE2005 | POPESTIMATE2006 | POPESTIMATE2007 | POPESTIMATE2008 | POPESTIMATE2009 | CENSUS2010POP_y | POPESTIMATE2010_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | Alabama | 4779736 | 4780125 | 4785437 | 4799069 | 4815588 | 4830081 | 4841799 | 4852347 | ... | 4480089 | 4503491 | 4530729 | 4569805 | 4628981 | 4672840 | 4718206 | 4757938 | 4779736 | 4785298 |
| 1 | Alabama | Autauga County | 54571 | 54597 | 54773 | 55227 | 54954 | 54727 | 54893 | 54864 | ... | 45909 | 46800 | 48366 | 49676 | 51328 | 52405 | 53277 | 54135 | 54571 | 54632 |
| 2 | Alabama | Baldwin County | 182265 | 182265 | 183112 | 186558 | 190145 | 194885 | 199183 | 202939 | ... | 147957 | 151509 | 156266 | 162183 | 168121 | 172404 | 175827 | 179406 | 182265 | 183195 |
| 3 | Alabama | Barbour County | 27457 | 27455 | 27327 | 27341 | 27169 | 26937 | 26755 | 26283 | ... | 28653 | 28594 | 28287 | 28027 | 27861 | 27757 | 27808 | 27657 | 27457 | 27411 |
| 4 | Alabama | Bibb County | 22915 | 22915 | 22870 | 22745 | 22667 | 22521 | 22553 | 22566 | ... | 21199 | 21399 | 21721 | 22042 | 22099 | 22438 | 22705 | 22941 | 22915 | 22867 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3186 | Wyoming | Sweetwater County | 43806 | 43806 | 43574 | 43986 | 45002 | 45157 | 44948 | 44719 | ... | 37428 | 37450 | 38026 | 38739 | 39749 | 41470 | 42358 | 44133 | 43806 | 43621 |
| 3187 | Wyoming | Teton County | 21294 | 21298 | 21296 | 21414 | 21624 | 22315 | 22773 | 23047 | ... | 18837 | 19066 | 19467 | 19632 | 20014 | 20472 | 20988 | 21232 | 21294 | 21302 |
| 3188 | Wyoming | Uinta County | 21118 | 21121 | 21089 | 20896 | 20996 | 20951 | 20822 | 20763 | ... | 19587 | 19480 | 19470 | 19494 | 19709 | 20171 | 20613 | 21054 | 21118 | 21137 |
| 3189 | Wyoming | Washakie County | 8533 | 8528 | 8530 | 8449 | 8409 | 8413 | 8273 | 8278 | ... | 7988 | 7976 | 7960 | 8022 | 7979 | 8169 | 8229 | 8423 | 8533 | 8556 |
| 3190 | Wyoming | Weston County | 7208 | 7208 | 7198 | 7142 | 7077 | 7136 | 7138 | 7208 | ... | 6578 | 6610 | 6646 | 6594 | 6717 | 7033 | 7133 | 7266 | 7208 | 7182 |
3191 rows × 27 columns
Now that we have our population data and our GDP data we need to combine them to calculate the GDP per capita. We can do this by applying a function to all the rows along the GPD DataFrame that will calculate the GDP Per Capita.
This function can use the county name, state name, and year from the row to look up the correct population from the population DataFrame and calculate the GDP per capita using this value.
def get_year_column(year):
return f"POPESTIMATE{year}"
def get_gdp_per_capita(r):
county_name = r["County Full Name"]
populations = population_data[(population_data["STNAME"] == r["State"]) & (population_data["CTYNAME"] == county_name)][get_year_column(r["Year"])]
population = populations.iloc[0] if populations.size > 0 else np.nan
return r["GDP (Chained $)"]/population if population is not np.nan else np.nan
def population_for_row(r):
county_name = r["County Full Name"]
populations = population_data[(population_data["STNAME"] == r["State"]) & (population_data["CTYNAME"] == county_name)][get_year_column(r["Year"])]
population = populations.iloc[0] if populations.size > 0 else np.nan
return population if population is not np.nan else np.nan
gdp_data["gdp_per_capita"] = gdp_data.apply(get_gdp_per_capita, axis=1)
gdp_data["population"] = gdp_data.apply(population_for_row, axis=1)
gdp_data
| index | Year | County FIPS | Region | SUB_REGION | State | STATE_ABBR | County | County Full Name | GDP (Chained $) | gdp_per_capita | population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2001 | 1001 | Southeast | East South Central | Alabama | AL | Autauga | Autauga County | 954588000 | 21265.521620 | 44889.0 |
| 1 | 1 | 2001 | 1003 | Southeast | East South Central | Alabama | AL | Baldwin | Baldwin County | 3929753000 | 27125.128559 | 144875.0 |
| 2 | 2 | 2001 | 1005 | Southeast | East South Central | Alabama | AL | Barbour | Barbour County | 813278000 | 28177.181859 | 28863.0 |
| 3 | 3 | 2001 | 1007 | Southeast | East South Central | Alabama | AL | Bibb | Bibb County | 292693000 | 13919.202967 | 21028.0 |
| 4 | 4 | 2001 | 1009 | Southeast | East South Central | Alabama | AL | Blount | Blount County | 810381000 | 15630.841933 | 51845.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 55558 | 55558 | 2018 | 56037 | Rocky Mountain | Mountain | Wyoming | WY | Sweetwater | Sweetwater County | 3836603000 | 89518.946288 | 42858.0 |
| 55559 | 55559 | 2018 | 56039 | Rocky Mountain | Mountain | Wyoming | WY | Teton | Teton County | 2166420000 | 93103.270446 | 23269.0 |
| 55560 | 55560 | 2018 | 56041 | Rocky Mountain | Mountain | Wyoming | WY | Uinta | Uinta County | 906587000 | 44677.064853 | 20292.0 |
| 55561 | 55561 | 2018 | 56043 | Rocky Mountain | Mountain | Wyoming | WY | Washakie | Washakie County | 358104000 | 45461.977910 | 7877.0 |
| 55562 | 55562 | 2018 | 56045 | Rocky Mountain | Mountain | Wyoming | WY | Weston | Weston County | 315885000 | 45621.750433 | 6924.0 |
55563 rows × 12 columns
Before we move on to exploring our data, it is important to make sure we don't have any unexpected missing values in our data. We can see that there are four counties that we were unable to find the population in order to calculate their GDP per capita. We can manually check these counties in the census data and verify that we do not have a population estimate available.
Since we don't have the population of these counties, we can remove them from the dataset.
gdp_data[gdp_data["gdp_per_capita"].isnull()]["County"].unique()
array(['Kusilvak', 'Oglala Lakota', 'LaSalle', 'Petersburg'], dtype=object)
gdp_data = gdp_data[~(gdp_data["gdp_per_capita"].isnull())]
ax = gdp_data.plot.scatter(x="Year", y="gdp_per_capita")
ax.set_title("GDP Per Capita vs. Time")
ax.set_ylabel("GDP Per Capita")
Text(0, 0.5, 'GDP Per Capita')
Looking at this plot we can clearly see that there is one county that has a GDP per capita value that is much higher than the rest of the counties in the country. In fact, if we look at the axis of this plot, the GDP per capita of this is greater than $60,000,000. This is an extremely high value. It's so high in fact that this is likely an error in our dataset as opposed to simply an outlier. We can verify this by looking at what county has the max GDP per capita.
gdp_data[gdp_data['gdp_per_capita'] == gdp_data['gdp_per_capita'].max()]
| index | Year | County FIPS | Region | SUB_REGION | State | STATE_ABBR | County | County Full Name | GDP (Chained $) | gdp_per_capita | population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 54161 | 54161 | 2018 | 48301 | Southwest | West South Central | Texas | TX | Loving | Loving County | 9557235000 | 6.457591e+07 | 148.0 |
If we do some research into Loving County, TX we can see that it is a very small county with a population of only 57 as of 2021! It is extremely unlikely that this county actually has a GDP per capita this high. So we can go ahead and remove this county from our dataset.
While we are at it, we can use a simple heuristic to remove any other outliers or errors from the data. We can remove a county if
$$gdpPerCapita > 200000 \land population < 10000$$It is very unlikely that a county this small would have a GDP this high. Additionally, we can remove any county where the GDP per capita is greater than \$400000. Some quick research shows us that New York County has the highest GDP per capita and it is less than \\$400000.
gdp_data = gdp_data[gdp_data["County Full Name"] != "Loving County"]
gdp_data = gdp_data[(~((gdp_data["gdp_per_capita"] > 200000) & (gdp_data["population"] < 10000))) & (gdp_data["gdp_per_capita"] < 400000)]
ax = gdp_data.plot.scatter(x="Year", y="gdp_per_capita")
ax.set_title("GDP Per Captita vs. Time")
ax.set_ylabel("GDP Per Capita")
Text(0, 0.5, 'GDP Per Capita')
Now the scatter plot looks much better and there are no obvious errors or outliers in the data.
It seems like there may be a positive trend in GDP per capita, but we will need to do some more analysis to confirm this suspicion.
counties = [("Los Angeles County", "CA"), ("Montgomery County", "MD"), ("Prince George's County", "MD"), ("New York County", "NY"), ("San Francisco County", "CA")]
for c, s in counties:
ax = gdp_data[(gdp_data["County Full Name"] == c) & (gdp_data["STATE_ABBR"] == s)].sort_values(by="Year").plot(x="Year", y="gdp_per_capita")
ax.set_title(f"{c}, {s} GDP per capita over time")
plt.show()
For these specific counties, it appears that the GDP is increasing over time. The trend seems to be fairly linear, however, some counties Prince George's and Montgomery counties, seem to have a peak somewhere around 2005-2007 and then the growth slows down after that.
Another thing we can learn from these plots is that all of these plots are offset from each other. In other words, even though the GDP increases for all counties, some counties start off with a much higher GDP per capita value than other counties. Our model will need to account for this. We will come back and visualize this in a bit.
We can also look at a chart of the mean GDP per capita over time, to verify that this trend holds in general.
ax = gdp_data.groupby("Year")["gdp_per_capita"].mean().plot()
ax.set_title("Average GDP per capita over time")
ax.set_ylabel("Average GDP per capita")
Text(0, 0.5, 'Average GDP per capita')
Next, we will take a look at plots of the GDP of specific geographic regions over time. If we are going to make a model that predicts changes in GDP per capita over time, based on geographical regions, we should look at what this data looks like first.
gdp_data["Region"].unique()
array(['Southeast', 'Far West', 'Southwest', 'Rocky Mountain',
'Great Lakes', 'Plains', 'New England', 'Mideast'], dtype=object)
regions = gdp_data["Region"].unique()
for region in regions:
ax = gdp_data[gdp_data["Region"] == region].groupby("Year")["gdp_per_capita"].mean().plot()
ax.set_title(f"{region} mean GDP per capita over time")
ax.set_ylabel("Average GDP per capita")
plt.show()
These plots tell us a couple of different things. First of we can see that the increase in GDP per capita seems to be more linear in some regions, such as the Mideast, than in other regions such as the Southeast. With this information, we can hypothesize that a linear regression model may perform worse than other regression models we may try.
We can also see that the average rate of increase varies by region as well as the average GDP per capita in each region at any given time.
We can use the plotly python library to create a map that shows each county with a color on a gradient representing how high the GDP per capita is in that county compared to the rest of the counties in the united states.
from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response)
# Add extra 0's to FIPS number if it's string representation is less than 5 characters.
gdp_data["County FIPS"] = gdp_data["County FIPS"].map(lambda x: str(x) if len(str(x)) >= 5 else "0"*(5-len(str(x))) + str(x))
fig = px.choropleth_mapbox(gdp_data[gdp_data["Year"] == 2018], geojson=counties, locations='County FIPS', color='gdp_per_capita',
mapbox_style="carto-positron",
zoom=2.5, center = {"lat": 37.0902, "lon": -95.7129},
opacity=0.5,
labels={'gdp_per_capita':'GDP Per Capita'})
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
Looking at this map it is clear to see that some counties have GDPs per capita that are much higher than the surrounding counties. If we explore the map a bit more, we can see that cities in particular have higher GDPs per capita than surrounding counties. Take a look at Manhatten for example. Manhatten has a GDP per capita of around \$368,000 whereas surrounding counties have GDP per capitas in the range of \\$25,000-\$75,000
Based off of this information, it makes sense to add a new feature to our dataset that will act as a baseline GDP per capita for each county. We can just use the GDP per capita from 2001 for this value.
def get_base_gdp(r):
county = gdp_data[gdp_data["County Full Name"] == r["County Full Name"]]
base_gdp = county[(county["State"] == r["State"]) & (county["Year"] == 2001)]["gdp_per_capita"]
base_gdp = base_gdp.iloc[0] if base_gdp.size > 0 else np.nan
return base_gdp
gdp_data["base_gdp"] = gdp_data.apply(get_base_gdp, axis=1)
gdp_data
| index | Year | County FIPS | Region | SUB_REGION | State | STATE_ABBR | County | County Full Name | GDP (Chained $) | gdp_per_capita | population | base_gdp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2001 | 01001 | Southeast | East South Central | Alabama | AL | Autauga | Autauga County | 954588000 | 21265.521620 | 44889.0 | 21265.521620 |
| 1 | 1 | 2001 | 01003 | Southeast | East South Central | Alabama | AL | Baldwin | Baldwin County | 3929753000 | 27125.128559 | 144875.0 | 27125.128559 |
| 2 | 2 | 2001 | 01005 | Southeast | East South Central | Alabama | AL | Barbour | Barbour County | 813278000 | 28177.181859 | 28863.0 | 28177.181859 |
| 3 | 3 | 2001 | 01007 | Southeast | East South Central | Alabama | AL | Bibb | Bibb County | 292693000 | 13919.202967 | 21028.0 | 13919.202967 |
| 4 | 4 | 2001 | 01009 | Southeast | East South Central | Alabama | AL | Blount | Blount County | 810381000 | 15630.841933 | 51845.0 | 15630.841933 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 55558 | 55558 | 2018 | 56037 | Rocky Mountain | Mountain | Wyoming | WY | Sweetwater | Sweetwater County | 3836603000 | 89518.946288 | 42858.0 | 104387.354671 |
| 55559 | 55559 | 2018 | 56039 | Rocky Mountain | Mountain | Wyoming | WY | Teton | Teton County | 2166420000 | 93103.270446 | 23269.0 | 83380.903876 |
| 55560 | 55560 | 2018 | 56041 | Rocky Mountain | Mountain | Wyoming | WY | Uinta | Uinta County | 906587000 | 44677.064853 | 20292.0 | 50759.748622 |
| 55561 | 55561 | 2018 | 56043 | Rocky Mountain | Mountain | Wyoming | WY | Washakie | Washakie County | 358104000 | 45461.977910 | 7877.0 | 42887.704512 |
| 55562 | 55562 | 2018 | 56045 | Rocky Mountain | Mountain | Wyoming | WY | Weston | Weston County | 315885000 | 45621.750433 | 6924.0 | 42663.789117 |
55007 rows × 13 columns
Again we can go through and remove any counties in which we were not able to get a base GDP per capita.
gdp_data = gdp_data[~(gdp_data['base_gdp'].isnull())]
Now that we have looked at and explored the data, we can make a model that will predict a county's GDP per capita based off of three features: the year, the county's region, and the county's baseline GDP per capita.
We saw during the data exploration that some regions of the united states have GDP trends that are not exactly linear. We will first use a decision tree regressor and then compare the model we get to a linear regression model. If we are correct that there are nonlinear trends in our data, the decision tree regressor should perform better.
In order to use the decision tree classifier we need to transform our input data a bit. The decision tree classifier expects numerical inputs, but one of the features (Region) is categorical. We will use a ColumnTransformer to transform just the Region column using a OneHotEncoder from the scikit learn library.
This will turn our categorical data into a list of numerical values that the decision tree model will be able to work with.
enc = OneHotEncoder()
transform = ColumnTransformer([
('one-hot', enc, ["Region"]),
('pass', 'passthrough', ['Year', 'base_gdp']),
])
X_train, X_test, y_train, y_test = model_selection.train_test_split(transform.fit_transform(gdp_data[["Year", "Region", "base_gdp"]]), gdp_data["gdp_per_capita"], test_size=.3)
clf = tree.DecisionTreeRegressor()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
0.9119436191700345
After splitting the data into test and training sets, training the mode, and then using the test data to evaluate the model. We can see that we get an $R^2$ score of around 0.912. This value is very close to 1 which indicates that the decision tree model is a good fit for our data.
Now we can create a linear regression model that uses the year and the region features and has another term with the baseline GDP added on.
smf.ols(formula="gdp_per_capita ~ Year*C(Region) + base_gdp", data=gdp_data).fit().summary()
| Dep. Variable: | gdp_per_capita | R-squared: | 0.702 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.702 |
| Method: | Least Squares | F-statistic: | 8076. |
| Date: | Fri, 16 Dec 2022 | Prob (F-statistic): | 0.00 |
| Time: | 14:32:59 | Log-Likelihood: | -5.9822e+05 |
| No. Observations: | 54946 | AIC: | 1.196e+06 |
| Df Residuals: | 54929 | BIC: | 1.197e+06 |
| Df Model: | 16 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -1.083e+06 | 9.01e+04 | -12.021 | 0.000 | -1.26e+06 | -9.06e+05 |
| C(Region)[T.Great Lakes] | 2.797e+05 | 1.06e+05 | 2.630 | 0.009 | 7.13e+04 | 4.88e+05 |
| C(Region)[T.Mideast] | -4.51e+04 | 1.26e+05 | -0.357 | 0.721 | -2.93e+05 | 2.03e+05 |
| C(Region)[T.New England] | 4.892e+05 | 1.7e+05 | 2.875 | 0.004 | 1.56e+05 | 8.23e+05 |
| C(Region)[T.Plains] | -1.039e+06 | 1.02e+05 | -10.202 | 0.000 | -1.24e+06 | -8.4e+05 |
| C(Region)[T.Rocky Mountain] | 4.924e+04 | 1.21e+05 | 0.407 | 0.684 | -1.88e+05 | 2.86e+05 |
| C(Region)[T.Southeast] | 8.354e+05 | 9.74e+04 | 8.573 | 0.000 | 6.44e+05 | 1.03e+06 |
| C(Region)[T.Southwest] | -4.594e+05 | 1.1e+05 | -4.189 | 0.000 | -6.74e+05 | -2.44e+05 |
| Year | 542.4089 | 44.821 | 12.102 | 0.000 | 454.559 | 630.259 |
| Year:C(Region)[T.Great Lakes] | -140.4507 | 52.921 | -2.654 | 0.008 | -244.177 | -36.725 |
| Year:C(Region)[T.Mideast] | 22.0359 | 62.870 | 0.351 | 0.726 | -101.189 | 145.261 |
| Year:C(Region)[T.New England] | -244.5586 | 84.692 | -2.888 | 0.004 | -410.555 | -78.562 |
| Year:C(Region)[T.Plains] | 519.0740 | 50.704 | 10.237 | 0.000 | 419.693 | 618.455 |
| Year:C(Region)[T.Rocky Mountain] | -24.7407 | 60.230 | -0.411 | 0.681 | -142.792 | 93.310 |
| Year:C(Region)[T.Southeast] | -417.8386 | 48.490 | -8.617 | 0.000 | -512.879 | -322.798 |
| Year:C(Region)[T.Southwest] | 228.7194 | 54.565 | 4.192 | 0.000 | 121.771 | 335.668 |
| base_gdp | 0.9679 | 0.003 | 336.271 | 0.000 | 0.962 | 0.974 |
| Omnibus: | 65098.985 | Durbin-Watson: | 1.899 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 27413462.913 |
| Skew: | 5.827 | Prob(JB): | 0.00 |
| Kurtosis: | 111.803 | Cond. No. | 1.92e+08 |
After fitting our linear regression model we get an $R^2$ value of 0.702. While this value is not bad, it is significantly lower than the $R^2$ value we got from the decision tree regressor. This tells us that we were correct in our hypothesis that the trends in GDP per capita are not completely linear and that a non linear model will perform better than a linear regressor.
Going through the data analysis process in this tutorial has given us a lot of insights into the trends that we can see in GDP across the country. In the process, we also learned a lot about the data science process in general. We learned about reading, cleaning, and combining data from multiple sources. We also walked through how to analyze/explore data and how we can gain key information from this process that will later help us create an accurate model for our data. Finally, we actually went through the model creation process twice and compared two different models, and then considered how the different models performed against our expectations.
All the data and python notebooks can be found here.